Committee Members
David V. Anderson, Ph.D. (advisor)
Justin Romberg, Ph.D.
Matthieu Bloch, Ph.D.
Larry Heck, Ph.D.
Mikle South, Ph.D.
Georgia Institute of Technology PhD Dissertation Defense
Machine Learning
Monday, 4 December 2023
You may access this presentation at https://nicolasshu.com/thesis_defense
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Line of Best Fit
You have all the data from the beginning
You receive the data one piece at a time
Intro
Line of Best Fit
You have all the data from the beginning
You can have all the data but load one piece of data at a time
Real-Time data
Intro
Intro
- I want to be Barbie and Ken for Halloween
- Oh yeah?
- I'll be Barbie and you'll be Ken
Intro
- I want to be Barbie and Ken for Halloween
- Oh yeah?
- I'll be Barbie and you'll be Ken
Intro
- We should get rid of tipping culture. Waiters and waitresses deserve fair livable wages.
Intro
Max Braverman
Autism Spectrum Disorder
Intro
What are some solutions?
Intro
What are some solutions?
Personally Check in
Hire a Care Giver
Surveillance
Intro
Surveillance
https://www.peoplemanagement.co.uk/article/1747153/one-in-seven-workers-say-employer-monitoring-has-increased-during-covid
Intro
Audio is capable of capturing
Different modalities are capable of capturing different information
Intro
Intro
Intro
We need a system that
1. Can identify new incoming speakers and re-identify them
2. Can operate in real-time in an online algorithm
Intro
Intro
Where to put sensors?
Intro
Where to put sensors?
Intro
Where to put sensors?
System Infrastructure
Intro
Where to put sensors?
System Infrastructure
Intro
Where to put sensors?
System Infrastructure
Speaker Identification Engine
Intro
Where to put sensors?
System Infrastructure
Speaker Identification Engine
Intro
Where to put sensors?
System Infrastructure
Speaker Identification Engine
Real-Time
Intro
Where to put sensors?
System Infrastructure
Speaker Identification Engine
Real-Time
Intro
Where to put sensors?
System Infrastructure
Speaker Identification Engine
Real-Time
User Interface
Intro
Where to put sensors?
System Infrastructure
Speaker Identification Engine
Real-Time
User Interface
Intro
Intro
Intro
Intro
Sensor Localization
Intro
Convex
Sensor Localization
Intro
Convex
Sensor Localization
Intro
Convex
Non-Convex
Sensor Localization
Intro
Convex
Non-Convex
Sensor Localization
Intro
Convex
Non-Convex
Simply connected
Simply connected
Sensor Localization
Intro
Convex
Non-Convex
Simply connected
Simply connected
Non-Convex
Non-Simply connected
Non-Convex
Non-Simply connected
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Non-simply Connected Environment
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Problems
- The data for each room is in local coordinates to the LiDAR (i.e. all centered at 0)
- Going from room to room, the orientation may change
Need to quickly manipulate the data, but no GUI was found
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Intro
Home Mapping
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
which can be applied to non-simply connected environments
A non-simply connected environment
Sensor Localization
Intro
which can be applied to non-simply connected environments
A non-simply connected environment
Sensor Localization
Intro
After a lot of adjustments, we created an algorithm which would allow for a proper visible tessellation
Sensor Localization
Intro
... but it doesn't work well in non-simply connected spaces
Sensor Localization
Intro
... but it doesn't work well in non-simply connected spaces
Plus, it is dependent on good initial conditions
Sensor Localization
Intro
Plus, it is dependent on good initial conditions
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Home Mapping
Sensor Localization
Intro
1. Tools to Quickly Map Environment
2. Networked Control to Maximize Coverage
Home Mapping
Maximum Coverage
Sensor Localization
Intro
Maximum Coverage
Home Mapping
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
System Infrastructure
Sensor Localization
Intro
System Infrastructure
Sensor Localization
Intro
System Infrastructure
No recording should ever leave the environment
Sensor Localization
Intro
System Infrastructure
Sensor Localization
Intro
System Infrastructure
Socket
Programming!
Sensor Localization
Intro
System Infrastructure
audiosockets
our Python package
Sensor Localization
Intro
System Infrastructure
audiosockets
Sensor Localization
Intro
System Infrastructure
audiosockets
...
Raspberry Pi
Server
Sensor Localization
Intro
System Infrastructure
Recorder
Processor
Server
Sensor Localization
Intro
System Infrastructure
Recorder
Processor
Server
Sensor Localization
Intro
System Infrastructure
Recorder
Processor
Recorder
Processor
Server
Sensor Localization
Intro
System Infrastructure
Recorder
Processor
Recorder
Processor
Recorder
Recorder
Processor
Processor
Server
Sensor Localization
Intro
System Infrastructure
Sensor Localization
Intro
System Infrastructure
Recorder
Processor
Server
Server
Sensor Localization
Intro
System Infrastructure
Recorder
Processor
Server
Server
Sensor Localization
Intro
System Infrastructure
Processor
Server
Recorder
Recorder
Server
Sensor Localization
Intro
System Infrastructure
Processor
Server
Recorder
Recorder
Recorder
Recorder
audiosockets
Server
Sensor Localization
Intro
System Infrastructure
Maximum Coverage
Home Mapping
System Infrastructure
Sensor Localization
Intro
1. System capable of communicating over network
2. Architecture merges different processes, reducing computational resources
3. Scalable
Maximum Coverage
Home Mapping
System Infrastructure
Sensor Localization
Intro
System Infrastructure
Sensor Localization
Intro
Speaker Identification
Sensor Localization
Intro
System Infrastructure
audio
Is this a new
speaker?
yes
no
Identify speaker
Enroll / Register
new speaker
Speaker =
Speaker =
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Prob. Graph. Models
Support Vector Machines
Neural Networks
Decision Trees
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Prob. Graph. Models
Support Vector Machines
Neural Networks
Decision Trees
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
But this requires a lot of data!
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Learn how to do a task well
(Meta-Learning)
Learn how to learn tasks well
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Learn how to do a task well
(Meta-Learning)
Learn how to learn tasks well
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Learn how to do a task well
(Meta-Learning)
Learn how to learn tasks well
Traditional Speaker Identification
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Learn how to do a task well
(Meta-Learning)
Learn how to learn tasks well
Traditional Speaker Identification
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Learn how to learn tasks well
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Input
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
Layer 8
x-vector
1
2
3
t-2
t-1
t
t+1
t+2
T
t
t
Time-Delay Neural Network
DNN
Stats Pooling
t-2
t-1
t
t+1
t+2
t-2
t-1
t
t+1
t+2
t-3
t+3
1
2
3
T
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Support Set
Query Set
Used to create prototypes
(i.e. centroids)
Used for training
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Randomly choose classes:
Input
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
Layer 7
Layer 8
x-vector
1
2
3
t-2
t-1
t
t+1
t+2
T
t
t
Time-Delay Neural Network
DNN
Stats Pooling
t-2
t-1
t
t+1
t+2
t-2
t-1
t
t+1
t+2
t-3
t+3
1
2
3
T
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Layer 7
Layer 8
DNN
Input
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
x-vector
1
2
3
t-2
t-1
t
t+1
t+2
T
t
t
Time-Delay Neural Network
Stats Pooling
t-2
t-1
t
t+1
t+2
t-2
t-1
t
t+1
t+2
t-3
t+3
1
2
3
T
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Input
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
x-vector
1
2
3
t-2
t-1
t
t+1
t+2
T
t
t
Time-Delay Neural Network
Stats Pooling
t-2
t-1
t
t+1
t+2
t-2
t-1
t
t+1
t+2
t-3
t+3
1
2
3
T
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Input
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
x-vector
1
2
3
t-2
t-1
t
t+1
t+2
T
t
t
Time-Delay Neural Network
Stats Pooling
t-2
t-1
t
t+1
t+2
t-2
t-1
t
t+1
t+2
t-3
t+3
1
2
3
T
Euclidean Distance
Assumption:
The latent subspace creates features which have Gaussian-like characteristics
Show formula
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Number of Samples in Support/Query Sets
Number of Classes/Speakers
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vector dimension
512-dim
128-dim
16-dim
Speaker Identification
Sensor Localization
Intro
System Infrastructure
X-Vector
System
x-vectors
prototypical
loss
Speaker Identification
Sensor Localization
Intro
System Infrastructure
X-Vector
System
x-vectors
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
X-Vector System
Speaker Identification
Sensor Localization
Intro
System Infrastructure
X-Vector System
Speaker Identification
Sensor Localization
Intro
System Infrastructure
X-Vector System
Speaker Identification
Sensor Localization
Intro
System Infrastructure
X-Vector System
Speaker Identification
Sensor Localization
Intro
System Infrastructure
X-Vector System
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with
< 41 mins
speakers with > 41 mins
80%
10%
10%
X-Vector System
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with
< 41 mins
speakers with > 41 mins
80%
10%
Stats for Gaussians
10%
seen
seen
seen
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with
< 41 mins
speakers with > 41 mins
80%
10%
Stats for Gaussians
seen
seen
seen
10%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with
< 41 mins
speakers with > 41 mins
80%
10%
Stats for Gaussians
seen
seen
seen
10%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with
< 41 mins
speakers with > 41 mins
80%
10%
Stats for Gaussians
seen
seen
seen
10%
Unseen
Seen
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with > 41 mins
80%
10%
Stats for Gaussians
seen
seen
seen
10%
speakers with > 41 mins
80%
10%
Stats for Gaussians
seen
seen
seen
10%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with > 41 mins
80%
10%
seen
seen
seen
10%
Stats for Gaussians
unseen
unseen
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with > 41 mins
80%
10%
seen
seen
seen
10%
Stats for Gaussians
unseen
unseen
?
?
?
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with > 41 mins
80%
10%
seen
seen
seen
10%
Stats for Gaussians
unseen
unseen
?
?
?
Compute F1 scores
speakers with > 41 mins
80%
10%
seen
seen
seen
10%
Stats for Gaussians
unseen
unseen
?
?
?
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Detection of New Classes
Maximum Coverage
Home Mapping
System Infrastructure
Sensor Localization
Intro
1. Found that x-vector dimensions can be reduced to 32-dim
2. Created method to detect new classes based on few-shot learning clustering
Detection of New Classes
Maximum Coverage
Home Mapping
Adaptive Few-Shot Speaker ID
System Infrastructure
Sensor Localization
Intro
Speaker Identification
Sensor Localization
Intro
System Infrastructure
2D Data
3D Data
What about 4 dimensions? 6 dimensions? 32 dimensions?
Speaker Identification
Sensor Localization
Intro
System Infrastructure
What about 4 dimensions? 6 dimensions? 32 dimensions?
Our desired x-vectors have 32 dimensions!
We can use t-SNE to check for qualitatively indications that the clusters have been clustered
Speaker Identification
Sensor Localization
Intro
System Infrastructure
edixl
fkvvo
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
Is this a new
speaker?
yes
no
Identify speaker
Enroll / Register
new speaker
Speaker =
Speaker =
This setup has many caveats!
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
Is this a new
speaker?
yes
no
Identify speaker
Enroll / Register
new speaker
Speaker =
Speaker =
Prob 1: The system will not know the actual labels as it creates predicted labels
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Solution:
Matching with Hungarian Algorithm
Speaker Identification
Sensor Localization
Intro
System Infrastructure
$10
$40
$50
$50
$80
$80
$50
$70
$60
Speaker Identification
Sensor Localization
Intro
System Infrastructure
$10
$40
$50
$50
$70
$60
$50
$80
$80
Speaker Identification
Sensor Localization
Intro
System Infrastructure
$10
$40
$50
$50
$70
$60
$50
$80
$80
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
There will be left overs classes when using a Hungarian Alg
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Greedy algorithms will use up every predicted class found!
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
I
L
L
L
L
K
K
J
J
A
C
C
B
B
B
B
B
D
D
E
E
E
F
G
G
H
This is VERY segmented!
Speaker Identification
Sensor Localization
Intro
System Infrastructure
I
L
L
L
L
K
K
J
J
A
C
C
B
B
B
B
B
D
D
E
E
E
F
G
G
H
Hungarian Algorithm
Speaker Identification
Sensor Localization
Intro
System Infrastructure
I
L
L
L
L
K
K
J
J
A
C
C
B
B
B
B
B
D
D
E
E
E
F
G
G
H
Greedy Algorithm
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
C
C
B
B
B
B
B
B
B
B
Speaker Identification
Sensor Localization
Intro
System Infrastructure
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
C
C
B
B
B
B
B
B
B
B
Hungarian Algorithm & Greedy Algorithm
Speaker Identification
Sensor Localization
Intro
System Infrastructure
High Segmentation
Near Perfect Greedy Matching
Low Hungarian Matching
Low Segmentation
Many mismatches in Greedy Matching
Many mismatches in Hungarian Matching
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Time (s)
Speaker
true label
predicted label
true label
predicted label
true label
predicted label
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Class
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Hungarian
Greedy
Baseline
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Hungarian
Baseline
Conclusion:
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
80%
10%
10%
Stats for Gaussians
seen
seen
unseen
unseen
seen
29 Speakers
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Experiment 2: Using 41min Covariance as Model Cov
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
80%
10%
10%
Stats for Gaussians
seen
seen
unseen
unseen
seen
29 Speakers
Experiment 2: Using 41min Covariance as Model Cov
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
29 Speakers
80%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
29 Speakers
80%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
29 Speakers
80%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
29 Speakers
80%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
29 Speakers
80%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
29 Speakers
80%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Experiment 2: Using 41min Covariance as Model Cov
Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?
speakers with > 41 mins
29 Speakers
80%
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Experiment 2: Using 41min Covariance as Model Cov
Baseline
Using 41min Covariance
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Hungarian
Greedy
Using 41min Covariance
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Covariance Adaptation
Experiment 3: 5s Initial Covariance Adaptation
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Covariance Adaptation
xvec queue
false
true
true
false
Collect the
x-vectors
...
Collection
Experiment 3: 5s Initial Covariance Adaptation
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Covariance Adaptation
xvec queue
true
false
Collect the
x-vectors
true
false
...
Collection
Trained Covariance
Train covariance
matrix on Collection
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Covariance Adaptation
xvec queue
true
false
Collect the
x-vectors
true
false
...
Collection
Train covariance
matrix on Collection
Trained Covariance
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Covariance Adaptation
xvec queue
true
false
Collect the
x-vectors
true
false
...
Collection
Train covariance
matrix on Collection
Trained Covariance
Experiment 3: 5s Initial Covariance Adaptation
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Covariance Adaptation
Experiment 3: 5s Initial Covariance Adaptation
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Covariance Adaptation
Experiment 3: Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Baseline
5s Initial Covariance Adaptation
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Hungarian
Greedy
5s Initial Covariance Adaptation
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Sample Mean at time T
Sample Covariance at time T
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Experiment 4: Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Experiment 4: Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Class
Update
Experiment 4: Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Baseline
Algorithmic Statistics
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Hungarian
Greedy
Algorithmic Statistics
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Class
Update
Experiment 5: 5s Cov. Adapt + Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Class
Update
Covariance Adaptation
Experiment 5: 5s Cov. Adapt + Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Class
Update
xvec queue
true
false
Collect the
x-vectors
true
false
...
Collection
Train covariance
matrix on Collection
Trained Covariance
Experiment 5: 5s Cov. Adapt + Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Class
Update
Covariance Adaptation
Experiment 5: 5s Cov. Adapt + Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Class
Update
Covariance Adaptation
Experiment 5: 5s Cov. Adapt + Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Baseline
5s Cov. Adapt + Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Hungarian
Greedy
5s Cov. Adapt + Algorithmic Stats
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Class
Update
Covariance Adaptation
Experiment 6: 5s Cov. Adapt + Algorithmic Mean
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
joint Maha. dist
to closest cluster
true
false
xvec queue
xvec queue
Create new
class
Classify as new cluster
Classify as closest cluster
Compute joint Maha. dists to closest cluster
Mahalanobis Classifier
Class
Class
Covariance Adaptation
Update
Experiment 6: 5s Cov. Adapt + Algorithmic Mean
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Baseline
5s Cov. Adapt + Algorithmic Mean
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Hungarian
Greedy
5s Cov. Adapt + Algorithmic Mean
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Detection of New Classes
Maximum Coverage
Home Mapping
Adaptive Few-Shot Speaker ID
System Infrastructure
Speaker Identification
Sensor Localization
Intro
Detection of New Classes
Adaptive Few-Shot Speaker ID
Maximum Coverage
Home Mapping
System Infrastructure
Speaker Identification
Sensor Localization
Intro
Maximum Coverage
Home Mapping
Detection of New Classes
Adaptive Few-Shot Speaker ID
System Infrastructure
Speaker Identification
Sensor Localization
Intro
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Brian Jones
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Intel i7-4770
4 cores / 8 threads
@ 3.40 GHz
June 2013
32 GB
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Recorder
Recorder
Recorder
Recorder
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Recorder
Recorder
Recorder
Recorder
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Recorder
Recorder
Recorder
Recorder
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker
Recorder
Recorder
Recorder
Recorder
Register New Class/Speaker
Speaker = New Speaker
Speaker = Closest Cluster
true
false
Update
speaker distr.
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker
Recorder
Recorder
Recorder
Recorder
Register New Class/Speaker
Speaker = New Speaker
Speaker = Closest Cluster
true
false
Update
speaker distr.
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker
Recorder
Recorder
Recorder
Recorder
Register New Class/Speaker
Speaker = New Speaker
Speaker = Closest Cluster
true
false
Update
speaker distr.
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker
Recorder
Recorder
Recorder
Recorder
Register New Class/Speaker
Speaker = New Speaker
Speaker = Closest Cluster
true
false
Update
speaker distr.
Front-End Dashboard
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker
Recorder
Recorder
Recorder
Recorder
Register New Class/Speaker
Speaker = New Speaker
Speaker = Closest Cluster
true
false
Update
speaker distr.
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker
Recorder
Recorder
Recorder
Recorder
Register New Class/Speaker
Speaker = New Speaker
Speaker = Closest Cluster
true
false
Update
speaker distr.
Front-End Dashboard
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
x-vectors
Compute joint Maha. dists to closest cluster
joint Maha. dist
to closest cluster
true
false
Create new
class
Classify as closest cluster
Classify as new cluster
Mahalanobis Classifier
Class
Class
Update
Covariance Adaptation
Audio
x-vector system
Compute joint Maha. dists to closest cluster
joint Maha. dist
to closest cluster
true
false
Create new
class
Classify as closest cluster
Classify as new cluster
Mahalanobis Classifier
Class
Class
Update
Covariance Adaptation
Inputs
Real-Time Platform
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Home Mapping
Maximum Coverage
Speaker
Register New Class/Speaker
Speaker = New Speaker
Speaker = Closest Cluster
Update
speaker distr.
false
true
Front-End Dashboard
Detection of New Classes
Adaptive Few-Shot Speaker ID
?
?
?
seen
seen
seen
joint Maha. dists to closest cluster
k = closest cluster
k = new cluster
Mahalanobis Classifier
Cov
Adapt
new class
F
T
Yash Kiarashi
Ratan Singh
Nasim Katebi
Giulia Da Poian
Sanmathi Kamath
Erick Perez
Jim Kinney
Pradyumna Suresha
Arjun Nakum
V. S. Krishna Madala
Chaitra Hegde
Ayse Cakmak
Robert Tweedy
Zifan Jiang
Clayton Feustel
Mohammad
Brandon Carroll
Devon Janke
Siuka Wong
Doug Chau
Brandon Lew
Dennis Delgado
Peiqi Yang
Sandy Wu
Hansol Choi
Sasha Keizs
Kate Lau
Angelica Quintana
Joe Small
Mike Mones
Best-Naz Eshaghi
Krishna Sanka
Uros Kuzmanovic
Mike Moxey
Luis Ortiz
Nicole Nowbahar
Jane Gong
Daniella Corporan
José Magalhães
Joel Corporan
Yash-Yee Logan
Sanghoon Lee
Luis Rosa
Nauman Ahad
Eric Qin
Will Sealy
Brett Ringel
Nathan Glaser
Norh Asmare
Harold Nikoue
Chris James Banks
Akash Patel
Moamen Soliman
Bogdan Vlahov
Adi Kambil
Ambuz Vimal
Mouhyemen Khan
William
Gagstetter
Andy Fan
Sensor Localization
Intro
There are a few steps that needed to be accomplished:
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
& Good Initial Conds
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Assumption:
Large hallway cross-sections lead to larger rooms than small hallway cross-sections
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
Sensor Localization
Intro
System Infrastructure
Sensor Localization
Intro
System Infrastructure
Sensor Localization
Intro
System Infrastructure
Sensor Localization
Intro
System Infrastructure
{
"PORT": 5050,
"HEADER": 64,
"FORMAT": "utf-8",
"DISCONNECT_MSG": "DISCONNECT",
"logging_format": "%(asctime)s - %(message)s",
"logging_level": "info"
}
1. Server Descriptor
Sensor Localization
Intro
System Infrastructure
{
"PORT": 5050,
"HEADER": 64,
"FORMAT": "utf-8",
"DISCONNECT_MSG": "DISCONNECT",
"logging_format": "%(asctime)s - %(message)s",
"logging_level": "info"
}
1. Server Descriptor
from audiosockets import MailmanSocket
mailman = MailmanSocket("server_info.json")
mailman.start()
2. Start up a server
Sensor Localization
Intro
System Infrastructure
{
"PORT": 5050,
"HEADER": 64,
"FORMAT": "utf-8",
"DISCONNECT_MSG": "DISCONNECT",
"logging_format": "%(asctime)s - %(message)s",
"logging_level": "info"
}
1. Server Descriptor
from audiosockets import MailmanSocket
mailman = MailmanSocket("server_info.json")
mailman.start()
2. Start up a server
Sensor Localization
Intro
System Infrastructure
{
"PORT": 5050,
"HEADER": 64,
"FORMAT": "utf-8",
"DISCONNECT_MSG": "DISCONNECT",
"logging_format": "%(asctime)s - %(message)s",
"logging_level": "info"
}
1. Server Descriptor
from audiosockets import MailmanSocket
mailman = MailmanSocket("server_info.json")
mailman.start()
2. Start up a server
3. Start Recorder Client
from audiosockets import RecorderSocket
recorder = RecorderSocket("server_info.json")
recorder.start()
Sensor Localization
Intro
System Infrastructure
{
"PORT": 5050,
"HEADER": 64,
"FORMAT": "utf-8",
"DISCONNECT_MSG": "DISCONNECT",
"logging_format": "%(asctime)s - %(message)s",
"logging_level": "info"
}
1. Server Descriptor
from audiosockets import MailmanSocket
mailman = MailmanSocket("server_info.json")
mailman.start()
2. Start up a server
3. Start Recorder Client
from audiosockets import RecorderSocket
recorder = RecorderSocket("server_info.json")
recorder.start()
Sensor Localization
Intro
System Infrastructure
{
"PORT": 5050,
"HEADER": 64,
"FORMAT": "utf-8",
"DISCONNECT_MSG": "DISCONNECT",
"logging_format": "%(asctime)s - %(message)s",
"logging_level": "info"
}
1. Server Descriptor
from audiosockets import MailmanSocket
mailman = MailmanSocket("server_info.json")
mailman.start()
2. Start up a server
3. Start Recorder Client
from audiosockets import RecorderSocket
recorder = RecorderSocket("server_info.json")
recorder.start()
4. Start a Processor
from audiosockets import BaseProcessorSocket
from audiosockets.utils import LogMelSpectrogram
class LogMelSpecProcessor(BaseProcessorSocket):
def __init__(self,*args, **kwargs):
super().__init__(*args, **kwargs)
def process_data(self,data):
fs = data["fs"]
audio = data["data"]
lms = LogMelSpectrogram(fs)(audio)
print(lms.shape)
processor = LogMelSpecProcessor("VAD", "server_info.json")
processor.start()
Sensor Localization
Intro
System Infrastructure
{
"PORT": 5050,
"HEADER": 64,
"FORMAT": "utf-8",
"DISCONNECT_MSG": "DISCONNECT",
"logging_format": "%(asctime)s - %(message)s",
"logging_level": "info"
}
1. Server Descriptor
from audiosockets import MailmanSocket
mailman = MailmanSocket("server_info.json")
mailman.start()
2. Start up a server
3. Start Recorder Client
from audiosockets import RecorderSocket
recorder = RecorderSocket("server_info.json")
recorder.start()
4. Start a Processor
from audiosockets import BaseProcessorSocket
from audiosockets.utils import LogMelSpectrogram
class LogMelSpecProcessor(BaseProcessorSocket):
def __init__(self,*args, **kwargs):
super().__init__(*args, **kwargs)
def process_data(self,data):
fs = data["fs"]
audio = data["data"]
lms = LogMelSpectrogram(fs)(audio)
print(lms.shape)
processor = LogMelSpecProcessor("VAD", "server_info.json")
processor.start()
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
speakers with > 41 mins
80%
10%
seen
seen
seen
10%
Stats for Gaussians
unseen
unseen
?
?
?
What happens if we vary the number of speakers enrolled?
29 Speakers
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Create an illustration of covariances increasing.
Specifically ovals increasing in different eigenvectors.
Speaker Identification
Sensor Localization
Intro
System Infrastructure
Brighter colors indicative of later stages
Speaker Identification
Sensor Localization
Intro
System Infrastructure